Development of a Gene-Based Soybean-Origin Discrimination Method Using Allele-Specific Polymerase Chain Reaction

A low soybean self-sufficiency rate in South Korea has caused a high import dependence and considerable price variation between domestic and foreign soybeans, causing the false labeling of foreign soybeans as domestic. Conventional soybean origin discrimination methods prevent a single-grain analysis and rely on the presence or absence of several compounds or concentration differences. This limits the origin discrimination of mixed samples, demonstrating the need for a method that analyzes individual grains. Therefore, we developed a method for origin discrimination using genetic analysis. The whole-genome sequencing data of the Williams 82 reference cultivar and 15 soybean varieties cultivated in South Korea were analyzed to identify the dense variation blocks (dVBs) with a high single-nucleotide polymorphism density. The PCR primers were prepared and validated for the insertion–deletion (InDel) sequences of the dVBs to discriminate each soybean variety. Our method effectively discriminated domestic and foreign soybean varieties, eliminating their false labeling.


Introduction
Soybean is an important food ingredient and a primary source of nutrients worldwide, with an abundance of isoflavones, carbohydrates, fats, and proteins.Soybean has been utilized mainly in the production of vegetable oil and as an ingredient in fermented foods, such as soybean paste, soy sauce, red pepper paste, fermented soybean paste, and fermented whole soybean.It has also been consumed in the form of processed foods, including soybean milk and soybean curd.Owing to its enrichment with the essential amino acid lysine, soybean is the main protein source in countries that consume rice as a staple food [1].
The soybean self-sufficiency rate in South Korea is ≤30%, resulting in a high import rate.The main producers of soybeans are the United States of America (U.S.), Brazil, Argentina, India, and China.Thus, South Korea primarily imports soybeans from the U.S., Canada, and China.The wholesale price of imported soybeans is considerably low, at 25% of the price of domestic soybeans [2].
Owing to the substantial price difference between domestic and imported soybeans, sellers might falsely label the country of origin to earn unfair profits.In 2022, the National Agricultural Products Quality Management Service (NAQS) in South Korea announced Foods 2023, 12, 4497.https://doi.org/10.3390/foods12244497https://www.mdpi.com/journal/foodsthat soybean curd ranked sixth and soybean ranked seventh among the 156 items violating the Act on Origin Labeling of Agricultural and Fishery Products [3].The origin of soybean is currently discriminated based on the difference in inorganic compounds using energy dispersive X-ray fluorescence spectrometry (ED-XRF).Furthermore, the absorbance difference across organic compounds is determined using Fourier transform near-infrared spectroscopy (FT-NIRS) [4][5][6].The conventional physicochemical method of analysis enables a markedly simple and rapid analysis that requires no pretreatment besides the grinding of samples.However, a single grain of soybean cannot be analyzed.Additionally, the minimum sample mass required for analysis is 5-50 g.These limitations could pose difficulties in the origin discrimination of mixed samples.
Extensive research has been conducted to distinguish growing plant varieties through morphological classification.Experts in origin determination have traditionally relied on distinguishing varieties based on seed shape, gloss, and color.However, a challenge arises as results tend to vary depending on the evaluator's level of expertise and experience.To address this issue, the field has turned to DNA molecular marker technology for breed identification [7].
The most frequently used method in soybean variety analysis is the simple sequence repeat (SSR), or microsatellite, approach used to establish barcode systems.However, this method requires expensive equipment such as a DNA sequencer or chip electrophoresis and a large number of markers, making it difficult to adopt as a general laboratory method [8,9].The second most used PCR method is the cleaved amplified polymorphic sequence (CAPS) with a single-nucleotide polymorphisms (SNPs) approach.This method is time-consuming owing to the involvement of restriction enzymes in addition to PCR [10].Recently, insertiondeletion (InDel) markers have achieved high reproducibility and are recognized as efficient molecular markers for distinguishing cultivars based on codominance [11].In particular, these markers are attracting attention from researchers because they are relatively simple compared to other molecular markers.Soybeans possess 20 chromosomes with a known genome euchromatic DNA size of 705 Mb.These chromosomes are categorized into sVB (sparse variation block), characterized by the absence of chromosomal mutation, and dVB (dense variation block), which occurs infrequently.Notably, studies have indicated the presence of dVB within 100 kb of the chromosome, and that genetic recombination does not occur often during the breeding process compared to the linkage disequilibrium block of 90-574 kb in soybean varieties [12].In a previous study, a method was proposed to differentiate between dVB and sVB, focusing on the identification of InDel markers within dVB to develop soybean variety identification methods [13,14].Some studies have proposed a DNA barcode method for discriminating 147 soybean varieties using the genomic DNA extracted from soybean leaves by using a selected set of InDel markers of dense variation blocks (dVB).However, this method focuses on domestic soybean varieties, and there is a limitation in identifying genetic diversity patterns of the imported varieties [15].Therefore, this study aimed to utilize genetic analysis to discriminate the varieties of soybean cultivated in South Korea.Unlike the physicochemical analysis techniques known to date, the gene-based analysis technology developed in this study can analyze each individual grain, thereby ensuring a high identification accuracy.Therefore, it is likely that a quantitative analysis will be able to identify the origin of mixed samples.The proposed origin discrimination table could eliminate the false labeling of foreign soybean varieties as domestic ones, thereby reducing unethical price markups by sellers.

Sample Collection
The 16 soybean varieties used as standard samples (15 Korean varieties and 1 American variety) were obtained from the Rural Development Administration.A total of 1096 samples (630 domestic and 466 imported) were collected from soybean farms, ports of entry, and large distributors between 2019 and 2021.The collected samples were used to construct the origin judgment value database.Furthermore, 60 soybean samples (30 domestic and 30 imported) were collected in 2022 to validate the origin discrimination table.The 30 domestic soybeans were obtained from 27 farms and local food markets nationwide.In addition, the imported soybeans included 11, 10, 4, 3, and 2 from the U.S., China, Canada, Thailand, and Vietnam, respectively.The collected samples were stored in a −20 • C freezer.

Genomic DNA Extraction
For DNA extraction from embryos of collected soybean samples, we used the Magnetic Bead System of an automatic nucleic acid extraction device (Hamilton Microlab Star ® , Hamilton Co., Reno, NV, USA).The NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) was used to measure the concentration and purity of the extracted genomic DNA.Finally, the purity was validated at 260/230 nm and 260/280 nm before PCR analysis to ensure that it fell within the range of 1.8-2.0.

Selection of Test Varieties and Molecular Markers
The complete whole-genome sequencing data of the soybean Williams 82 cultivar reference genome and those of the 15 known domestic varieties (Daewon, Taekwang, Pungsannamul, Seonyu, Daepung, Sinhwa, Hwangkeum, Nampung, Cheonsang, Uram, Hwangkeumol, Saedanbaek, Pungwon, Cheongja1, and Cheongja3) were analyzed to compare dVB with a high SNP density.Based on the result, 11 InDel markers were selected to identify 16 standard samples.

Primer Preparation and PCR Analysis
The 11 selected InDel markers were used to prepare the primers via allele-specific PCR to allow for easy differentiation based on the state of PCR amplification (Figure 1).For the PCR mixture, 40 ng of genomic DNA, 0.2 pmol of primer, and 10 µL of anti-HS Taq Premix (2× reaction buffer, 4 mM MgCl 2 , 0.5 mM dNTP, and 1 unit of Anti HS Taq DNA polymerase (TNT Research, Jeonju-si, South Korea)) were used to perform 10 min pre-denaturation at 94 • C, 30 s denaturation at 94 • C, 30 s annealing at 56 • C, and 10 min extension at 72 • C. The PCR reaction was terminated at 4 • C. Afterward, the amplified PCR product was loaded to 3% agarose gel with GoldView (SBS Genentech, Beijing, China) for 30 min electrophoresis at 200 V.The PCR product was confirmed using ultraviolet light.Furthermore, confirmation of the PCR product was carried out through QIAxcel electrophoresis (QIAGEN, Hilden, Germany).

Determination of Judgment Values
Marker 1 is a specific marker for soybean's endogenous genes, enabling the determination of whether the analyzed sample is soybean or not.In essence, successful amplification in PCR is essential for discerning the origin of soybeans.Therefore, marker 1 was exceptionally assigned a score of 1. Subsequently, markers 2 to 11 were scored as 2^n with each marker obtaining the following points through allele-specific PCR amplification: marker 2 (2 points), marker 3 (4 points), marker 4 (8 points), marker 5 (16 points), marker 6 (32 points), marker 7 (64 points), marker 8 (128 points), marker 9 (256 points), marker 10

Determination of Judgment Values
Marker 1 is a specific marker for soybean's endogenous genes, enabling the determination of whether the analyzed sample is soybean or not.In essence, successful amplification in PCR is essential for discerning the origin of soybeans.Therefore, marker 1 was exceptionally assigned a score of 1. Subsequently, markers 2 to 11 were scored as 2 n with each marker obtaining the following points through allele-specific PCR amplification: marker 2 (2 points), marker 3 (4 points), marker 4 (8 points), marker 5 (16 points), marker 6 (32 points), marker 7 (64 points), marker 8 (128 points), marker 9 (256 points), marker 10 (512 points), and marker 11 (1024 points).The cumulative sum of scores from all amplified markers was then calculated to determine the overall judgment value.The calculated scores using 11 markers could theoretically allow for the genetic diversity pattern discrimination of 2048 species based on judgment values.After checking the genetic polymorphism of the 16 standard samples, 1096 collected soybean samples were used to generate the origin discrimination tables.

Validation of the Origin Discrimination Table
The origin discrimination formula was validated based on judgment values using the 11 markers applied to 30 domestic and 30 imported soybeans.For the discrimination formula, the sensitivity, selectivity, and efficiency were estimated through qualitative analysis [16,17].
True Domestic Product (TD) indicates a domestic sample identified as domestic based on the discrimination result.False Domestic Product (FD) indicates an imported sample identified as domestic based on the discrimination result.True Foreign Product (TF) indicates an imported sample identified as foreign based on the discrimination result.False Foreign Product (FF) indicates a domestic sample identified as foreign based on the discrimination result.Sensitivity indicates the level at which the discrimination table can correctly identify a domestic sample.Similarly, selectivity indicates the level at which the discrimination table can correctly identify a foreign sample.To evaluate the prediction performance of the established discrimination table, the judgment values of the tested samples were applied to the table.The percentage of the discrimination of domestic samples as domestic was set as the domestic predictive rate and that of the discrimination of foreign samples as foreign was set as the foreign predictive rate.

Validation of the Selected Molecular Markers
Seventeen molecular markers were selected from the InDel region of dVB that was predicted to enable the identification of the 16 soybean varieties of standard samples through biodata analyses.The PCR amplification of the selected markers was checked against the Williams 82 reference.However, certain markers led to nonspecific reaction products unable to be used in allele-specific PCR, which relies on the presence or absence of PCR-amplified products for molecular markers.Ultimately, 11 molecular markers capable of allele-specific PCR were selected through verification, and these were used in tests to determine the origin discrimination (Table 1).

Multiplex Allele-Specific PCR Analysis
In previous studies, the interpretation of results obtained via PCR was complicated as both the presence or absence of amplification of the InDel marker and the size difference in the PCR amplification products were considered.In this study, the allele-specific PCR including the InDel sequences allowed for an intuitive result interpretation based on the state of PCR amplification.Although the results can be confirmed more easily than with existing analysis methods, it still takes much time and labor to confirm the origin of one sample because, to do this, PCR must be performed on 24 single soybean grains and confirmed via electrophoresis.To solve this problem, we set up six groups of 11 markers to enable multiplex PCR.To validate the feasibility of allele-specific PCR for InDel markers 1 to 11 at a single annealing temperature, amplification was assessed across a temperature range of 46 to 60 • C.Among these temperatures, we found that only the target product was amplified at 56 • C, without any non-specific PCR products (Figure 2).We attempted to confirm 11 InDel markers via a single multiplex PCR; however, because of the size of the amplified product and non-specific reaction, we optimized a total of six groups (Figure 3).Accordingly, we were able to significantly reduce the time to determine the geographical origin through multiplex allele-specific PCR.

Determination of Judgment Values for Standard and Test Samples
Because a minimum sample of 24 single grains is required for 95% reliability, 200 g of the collected sample was evenly distributed with a grain spreader, and 24 single grains were ultimately sampled [18].After DNA extraction, multiplex PCR analysis was performed on the samples using the 11 InDel markers that were assigned unique scores.Then, the scores given to the amplified markers were calculated to confirm the judgment values.
First, we confirmed, through analysis, whether the 16 standard samples were clearly identified.Among the 16 varieties (15 domestic and 1 imported), 14 varieties (13 domestic and 1 imported) could be discriminated (Table 2).The two domestic varieties that could not be discriminated were Seonyu and Hwangkeumol, which shared an identical judgment value (655).

Determination of Judgment Values for Standard and Test Samples
Because a minimum sample of 24 single grains is required for 95% reliability, 200 g of the collected sample was evenly distributed with a grain spreader, and 24 single grains

Determination of Judgment Values for Standard and Test Samples
Because a minimum sample of 24 single grains is required for 95% reliability, 200 g of the collected sample was evenly distributed with a grain spreader, and 24 single grains  Second, to set the judgment value for a variety of soybean samples, 1096 soybean samples were collected to include the domestic soybeans cultivated in South Korea, those imported through the port of entry, and the foreign soybeans currently being distributed.As a result, domestic and imported soybean varieties could be classified based on 53 and 70 judgment values, respectively (Tables 3 and 4).These varieties had four overlapping judgment values (671; 1183; 1215; and 1695) that prevented origin discrimination across the corresponding domestic and imported soybeans.The aim of this study was to discriminate the origin of soybeans based on the variation of judgment values between domestic and imported varieties rather than accurately identify the variety of soybeans.For the two domestic varieties that could not be discriminated in the standard samples, the judgment value (655) did not overlap with the judgment values of imported soybeans.The four judgment values that were the same for domestic and foreign products accounted for approximately 3.4% of the total judgment values, and the samples for which judgment was impossible accounted for approximately 5% of the total sample.However, if foreign varieties (1215) with the same judgment value as Pungsannamul (domestic varieties) are excluded because they have obvious morphological differences (Figure 4), the rate of the inability to make a judgment decreases from approximately 3.4% of the total judgment value to approximately 2.5%.Therefore, the proportion of samples that could not be determined can be reduced from approximately 5% to approximately 3.2%.

Validation of the Origin Discrimination Table
In previous studies that used methods of inorganic content analysis, such as inductively coupled plasma-mass spectrometry or ED-XRF, the reported efficiency was 91.0-94.0%.Moreover, various statistical techniques were applied based on the concentration of 4-8 types of inorganic compounds [19][20][21].Lee et al. [5] utilized FT-NIRS, a method of organic content analysis.The reported efficiency was 96.1-96.5% based on the difference in the absorbance spectra of organic compounds using the NIR.Through the gene-based analysis in this study, 595 out of 630 domestic samples were predicted as being domestic, with 94.4% sensitivity.In addition, 446 out of 466 imported samples were predicted as being foreign, with 95.7% selectivity.Finally, the efficiency of the origin discrimination table was 95.0%.Conversely, in the analysis based on 10 markers (marker 1-10), 506 of 630 domestic samples were predicted to be domestic, achieving a sensitivity of 80.3%.In addition, 411 of 466 imported samples were predicted to be foreign, demonstrating a selectivity of 88.2%.However, the overall efficiency of the origin discrimination table was 83.7%, and a significant reduction was observed when compared to the analysis with 11 markers.Thus, the analytical method developed in this study exhibited a similar level of efficiency to conventional physicochemical analyses (95.0% vs. 91.0-96.5%)(Table 5).

Validation of the Origin Discrimination Table
In previous studies that used methods of inorganic content analysis, such as inductively coupled plasma-mass spectrometry or ED-XRF, the reported efficiency was 91.0-94.0%.Moreover, various statistical techniques were applied based on the concentration of 4-8 types of inorganic compounds [19][20][21].Lee et al. [5] utilized FT-NIRS, a method of organic content analysis.The reported efficiency was 96.1-96.5% based on the difference in the absorbance spectra of organic compounds using the NIR.Through the gene-based analysis in this study, 595 out of 630 domestic samples were predicted as being domestic, with 94.4% sensitivity.In addition, 446 out of 466 imported samples were predicted as being foreign, with 95.7% selectivity.Finally, the efficiency of the origin discrimination table was 95.0%.Conversely, in the analysis based on 10 markers (marker 1-10), 506 of 630 domestic samples were predicted to be domestic, achieving a sensitivity of 80.3%.In addition, 411 of 466 imported samples were predicted to be foreign, demonstrating a selectivity of 88.2%.However, the overall efficiency of the origin discrimination table was 83.7%, and a significant reduction was observed when compared to the analysis with 11 markers.Thus, the analytical method developed in this study exhibited a similar level of efficiency to conventional physicochemical analyses (95.0% vs. 91.0-96.5%)(Table 5).To determine the practicality of the origin discrimination table proposed in this study, soybean samples from 2019 to 2022, with accurately identified origins, were collected, and their judgment values were applied to the discrimination table.
The judgment values calculated using the 11 InDel markers selected for 30 domestic and 30 imported soybeans were applied to the discrimination table.As shown in Table 6, 29 out of 30 domestic soybeans were discriminated as domestic, showing a 96.7% domestic predictive rate.Furthermore, all 30 imported soybeans were discriminated as foreign, demonstrating a 100.0%foreign predictive rate.Substantially high levels of domestic and foreign predictive rates were obtained at 98.3% on average for the discrimination table using 11 gene-based markers.These results suggest that the discrimination table can effectively discriminate between domestic and imported soybean varieties despite the slightly low level of domestic soybean discrimination caused by a low sensitivity (94.4%) and domestic predictive rate (96.7%) compared to the selectivity (95.7%) and foreign predictive rate (100.0%).The conventional physicochemical analyses of organic and inorganic compounds for soybean origin discrimination are classification methods for variation in cultivation conditions using statistical techniques.As such, the predictive rate and efficiency may vary [22][23][24].By contrast, gene-based analyses are independent of compositional changes depending on the cultivation conditions.Soybeans that have been developed to adapt to the climate and pests of each country are known to be unable to be grown successfully in other countries.In particular, the production areas of soybeans imported into Korea are limited to some areas, and these areas have different latitudes and longitudes, compared to Korea.Therefore, it is highly unlikely that imported soybeans can be grown domestically.Therefore, it has been possible to accurately determine the country of origin by utilizing the unique genetic characteristics of soybean varieties.

Conclusions
In Korea, the price difference of soybeans is 3-5 times more or less than other countries, depending on the soybean origin.Due to the large price difference between domestic and imported soybeans, there is a good possibility that sellers will misrepresent the country of origin of the soybeans.Currently, the origin of soybeans is identified using physicochemical analysis methods such as NIR and XRF to prevent origin misrepresentation; however, because existing physicochemical methods involve the crushing and testing of large amounts of samples, there is a limitation to the extent to which the country of origin can be identified when domestic and imported soybeans are mixed.Consequently, sellers are taking advantage of this limitation and selling a mixture of soybeans from different origins.Therefore, we developed a gene-based analysis method that can identify the country of origin of soybeans on a grain-by-grain basis.In summary, the country-of-origin identification method developed with 11 InDel markers showed an efficiency of 95%, and the validation process of the country-of-origin identification table showed a prediction rate of 98.3%, confirming that the country-of-origin identification at the grain level has a high accuracy.Based on these results, the method developed in this study can be applied to identify the origin of soybeans, and, if combined with existing physicochemical methods, it is expected to prevent illegal acts including the misrepresentation of origin with a higher accuracy.
Furthermore, genetically modified soybeans with herbicide resistance are currently cultivated in several countries, and the use of herbicides, such as glyphosate, saflufenacil, and carfentrazone-ethyl, is rapidly increasing.Conversely, in Korea, the cultivation of genetically modified soybeans is prohibited, and the unintentional tolerance level is maintained below 3%.Therefore, by confirming the country of origin, it is possible to prevent the domestic distribution of genetically modified soybeans and mitigate exposure to harmful substances, such as herbicides [25,26].
Moreover, the genetics-based method for determining the origin of soybeans developed in this study can be applied in quality management during the food manufacturing process.In particular, it is expected to be applicable to intermediate stages (meju) or final products (soybean paste, natto, doenjang, gochujang, etc.) of fermented foods using soybeans.

Figure 1 .
Figure 1.Confirmation of genetic patterns using allele-specific polymerase chain reaction (PCR) markers.The letters G, C, and A represent nucleotides.

Figure 1 .
Figure 1.Confirmation of genetic patterns using allele-specific polymerase chain reaction (PCR) markers.The letters G, C, and A represent nucleotides.

Figure 4 .
Figure 4. Morphological comparison between Pungsannamul and Chinese varieties with an overlapping judgment value (1215).Morphological comparison between (A) Pungsannamul and (B) Chinese varieties with an overlapping judgment value (1215).

Figure 4 .
Figure 4. Morphological comparison between Pungsannamul and Chinese varieties with an overlapping judgment value (1215).Morphological comparison between (A) Pungsannamul and (B) Chinese varieties with an overlapping judgment value (1215).

Table 1 .
A list of the eleven allele-specific PCR markers used for the origin discrimination of soybean in this study.

Table 2 .
Assignment of judgment values for 16 standard samples using 11 InDel markers.
* Standard samples with overlapping judgment values.

Table 3 .
Discrimination table of domestic soybean using judgment values of 11 markers for soybean origin.
* Overlapping judgment values for domestic and foreign soybean varieties.

Table 5 .
Classification performance parameters of gene-based analysis for discriminating Korean and imported soybeans.

Table 6 .
Validation results for the discrimination table using judgment values of 11 markers for soybeans.